by Adam M. Lang
January 22, 2021
MIT-Applied Data Science Bootcamp
Context
Climate Change is a major issue in today’s world. Global Warming is increasing by the minute.
Greenhouse gases used to be absorbed/emitted and regulated by Earths natural systems and temperature.
However, the industrial revolution that began in 1760 and continued to present day has continued to add suffocating abnormal levels of greenhouse gases to Earth’s atmosphere. CO2 is the predominant greenhouse gas being emitted by humans and thus causing the most damage to the atmosphere and concern.
CO2 is about 84% or more of the greenhouse gases emitted by humans and the industrial revolution. CO2 accounts for 30 billion tons per year of greenhouse gas emissions. Levels are well above 400ppm, most recently the highest level recorded at Mount Loa Laboratory in Hawaii, USA was 417 ppm in 2020. Before the industrial revolution CO2 levels were at 270ppm or lower which was considered “safe”. Climate scientists aim for us to globally reduce CO2 levels to below 350ppm, problem is this number has been climbing every year.
We know that fossil fuels are used extensively for electricity, transportation, industrial processes and other human activities. Industrial natural CO2 gas emissions in the USA have risen the most since 2009 as seen in the graph above provided by the US Energy Information Administration. Ironically, increasing use of natural gas has helped reduce the overall US CO2 emissions because it is the least carbon intensive of the fossil fuels used for electricity generation and industrial process heating. However it is still a concern as it accounts for almost 40% of electric power consumption in the USA. In recent years, the drop in natural gas prices, coupled with highly efficient natural gas-fired combined-cycle technology, made natural gas an attractive choice to serve baseload demand previously met by coal-fired generation. Climate scientists are concerned about another greenhouse gas that leaks into the atmosphere during natural gas production: methane. Methane has a warming effect up to 80 or 90 times more powerful than C02 over a 20-year timescale.
Data Science Objective
We thus aim to forecast the CO2 emissions for natural gas (the least carbon intensive fossil fuel) as closely monitoring the aforementioned rise in the use of this gas can only help to decrease our overall CO2 footprint in the USA and the atmosphere and thus reduce global warming.
Forecast the carbon emissions value for natural gas (NNEIEUS) fuel type for next 12 months (2016-07 to 2017-07) and propose certain measures that can be adopted as policies to reduce these emissions.
This is past monthly data of Carbon dioxide emissions from electricity generation from the
US Energy Information Administration categorized by fuel type such as Coal, Natural gas and more.
This dataset has been pre processed and prepared to serve the purpose of exploring the problem of time series techniques. The raw data set is spread across many files and can be accessed at: https://www.eia.gov/electricity/data.php
# Upgrade the statsmodels package to use AutoReg and ARMA Models
!pip install statsmodels --upgrade
# To ignore the warnings
import warnings
warnings.filterwarnings('ignore')
#if using google colab
from google.colab import drive
drive.mount('/content/drive')
#import libraries
import pandas as pd
import numpy as np
from scipy import stats
#Importing libraries for visualization
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
import pylab
#Importing library for date manipulation
from datetime import datetime as dt
#import data
df = pd.read_excel('/content/drive/MyDrive/Data Set.xlsx')
#head of data
df.head()
#tail of data
df.tail()
#check value counts of MSN
df['MSN'].value_counts()
#df shape
df.shape
#number of rows per 9 energy types
5094/9
#check for missing values
df.isnull().sum().sum()
Preliminary observations:
#check df info
df.info()
Observations
#write function to parse integer date to datetime format
dateparse = lambda x: pd.to_datetime(x, format='%Y%m', errors = 'coerce')
df = pd.read_excel("/content/drive/MyDrive/Data Set.xlsx", parse_dates=['YYYYMM'], index_col='YYYYMM', date_parser=dateparse)
df.head()
#check tail
df.tail()
Now we have proper datetime format in years and month.
#check datatypes again
df.info()
#dataframe of not null groups
df = df[pd.Series(pd.to_datetime(df.index, errors='coerce')).notnull().values]
df.head(15)
#convert value to float
df['Value'] = pd.to_numeric(df['Value'] , errors='coerce')
#check dtypes
df.dtypes
Successfully converted Value to float.
#check dataframe info
df.info()
#check null values
df.isnull().sum().sum()
It appears that 384 rows have missing values after the transformation of the Value to float. Let's take a closer look at this.
#identify null values in dataframe
df_nan = df.loc[df.isnull().any(axis=1)]
df_nan.head()
#check MSN data type to see if there are any missing values in the natural gas category
df_nan['MSN'].value_counts()
#drop na
df.dropna(inplace=True)
df.info()
Now we have an equally aligned dataset without na values.
#describe data
df.describe()
df.shape
Observations
#categorical data describe
df.describe(include='object')
observations
#group energy types
Energy_types = df.groupby('Description')
Energy_types.head()
#reset index to plot dataframe
df.reset_index(inplace=True)
df.head()
#multivariate time series plot
fuels = df.groupby('Description')
fig, ax = plt.subplots(figsize=(20,9))
for desc, group in fuels:
group.plot(x='YYYYMM', y='Value', label=desc, ax=ax, title='Carbon Emissions per Fuel')
group.plot(x='YYYYMM', y='Value', title=desc)
#fuels.plot(x='YYYYMM', y='Value')
Observations
#set index again
df.set_index('YYYYMM',inplace=True)
df.head()
#groupby description and value
CO2_per_source = df.groupby('Description')['Value'].sum().sort_values()
#check index of new dataframe
CO2_per_source.index
#shorten names of columns for plotting
cols = ['Geothermal Energy', 'Non-Biomass Waste', 'Petroleum Coke','Distillate Fuel ',
'Residual Fuel Oil', 'Petroleum', 'Natural Gas', 'Coal', 'Total Emissions']
#create bar plot of CO2 emissiosn by electric power sector
fig = plt.figure(figsize = (16,9))
x_label = cols
x_tick = np.arange(len(cols))
plt.bar(x_tick, CO2_per_source, align = 'center', alpha = 0.5)
fig.suptitle("CO2 Emissions by Electric Power Sector", fontsize= 25)
plt.xticks(x_tick, x_label, rotation = 70, fontsize = 20)
plt.yticks(fontsize = 20)
plt.xlabel('Carbon Emissions in MMT', fontsize = 20)
plt.show()
observations
#create dataset
gas = df[df.Description.str.contains('Natural Gas Electric Power Sector CO2 Emissions',case=False)]
gas.head()
#check size of dataframe
gas.shape
#check datatypes
gas.dtypes
#drop unnecessary columns
gas = gas.drop(columns=['MSN','Description'])
gas.head()
#descriptive statistics
gas.describe()
Observations
#plot time series
gas.index
ax=gas['Value'].plot(figsize=(16,8), title='CO2 Emissions for Natural Gas')
ax.set(xlabel='Dates',ylabel='CO2 Emissions in Million Metric Tons of Carbon Dioxide')
plt.show()
Natural Gas Time Series appears to have seasonality based on this plot.
#lag plot
pd.plotting.lag_plot(gas['Value'])
plt.title('Lag plot for Natural Gas CO2 Emissions')
plt.show()
We will make a distribution plot to examine the distribution of the time series variables.
#Distribution plot
sns.distplot(gas['Value'])
plt.title('Distribution plot for CO2 Emission Values for Natural Gas')
plt.show()
We can see the distribution of the time series has a skew to it which also helps to show it is not stationary.
#probability plot
stats.probplot(gas['Value'],dist="norm",plot=pylab)
pylab.show()
The probability distribution plot validates that the time series is NOT stationary and continues to change over time as the points do not lie directly on the red line above.
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
plt.figure(figsize = (16,8))
plot_acf(gas)
plt.show()
plot_pacf(gas)
plt.show();
From the above plots we can see there is autocorrelation in the natural gas dataset. This means when the CO2 emissions increase they continue to rise. When they decrease they continue to decrease.
gas.head()
# Calculating the rolling mean and standard deviation for a window of 12 observations
rolmean=gas.rolling(window=12).mean() #calculate the mean here
rolstd=gas.rolling(window=12).std() #calculate the standard deviation here
print(rolmean.head(15))
print('**************************************')
print('**************************************')
print(rolstd.head(15))
#Visualizing the rolling mean and standard deviation
plt.figure(figsize=(16,8))
actual = plt.plot(gas, color='blue', label='Actual Series') #fill the dataframe name
rollingmean = plt.plot(rolmean, color='red', label='Rolling Mean') #fill the dataframe name
rollingstd = plt.plot(rolstd, color='green', label='Rolling Std. Dev.') #fill the dataframe name
plt.title('Rolling Mean & Standard Deviation of the Natural Gas Time Series')
plt.xlabel('Years')
plt.ylabel('CO2 Emissions for Natural Gas')
plt.legend()
plt.show()
Observations
#Define a function to use adfuller test
def adfuller(gas):
#Importing adfuller using statsmodels
from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
adftest = adfuller(gas['Value'])
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Observations'])
for key,value in adftest[4].items():
adfoutput['Critical Value (%s)'%key] = value
print(adfoutput)
adfuller(gas)
P Value is quite large at 0.99 without splitting data into train and test so we can see the raw time series is NOT stationary and this will have to be thus transformed prior to any modeling.
#Importing the seasonal_decompose to decompose the time series
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(gas) #Use the actual series to decompose
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
f, axs = plt.subplots(2,2,figsize=(15,15))
plt.subplot(411)
plt.plot(gas, label='Actual')
plt.legend()
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend()
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend()
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend()
plt.tight_layout()
#install ruptures package
!pip install ruptures
Other changepoint detection methods include: PELT, Binary Segmentation, and Dynamic Programming. For our purposes the window-based method will suffice.
#import ruptures package
import ruptures as rpt
#Convert the time series values to a numpy 1D array
changepoint1=np.array(gas['Value'])
#Changepoint detection with window-based search method
model = "l2"
algo = rpt.Window(width=40, model=model).fit(changepoint1)
my_bkps = algo.predict(n_bkps=10)
rpt.show.display(changepoint1, my_bkps, figsize=(10, 6))
plt.title('Change Point Detection: Window-Based Search Method')
plt.show()
Observations
#examine last changepoint
gas.iloc[450:500]
Observations
#examine previous changepoint
gas.iloc[325:400]
#check tail of dataframe
gas.tail()
In summary, we will include the last 15 years in our training data set. We know that this will start at July 2000 and end at July 2015. Based on our changepoint analysis we see that we will be including two major changepoints that occurred between approximately 2000 and 2006 and then approximately 2010-2014. Based on our outcomes we could always consider adding or subtracting years. We do know from the EIA Data description that industrial natural gas usage has increased the most since 2009 so it is good that we are completely including this major trend in our training data.
# Splitting the data into train and test
#split based on date
df_train = gas.loc['2000-06-01':'2015-06-01']
df_test = gas.loc['2015-07-01' : '2016-07-01']
print(df_train)
print(df_test)
#install pmdarima library
!pip install pmdarima
#install fbprophet
#!pip install fbprophet
#import statistics libraries
import statsmodels.api as sm
import scipy
from scipy.stats import anderson
from statsmodels.tools.eval_measures import rmse
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import month_plot, seasonal_plot, plot_acf, plot_pacf, quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing
from statsmodels.stats.diagnostic import acorr_ljungbox as ljung
from statsmodels.tsa.statespace.tools import diff as diff
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pmdarima as pm
from pmdarima import ARIMA, auto_arima
from scipy import signal
from scipy.stats import shapiro
from scipy.stats import boxcox
from scipy.special import inv_boxcox
from sklearn.preprocessing import StandardScaler
from scipy.stats import jarque_bera as jb
from itertools import combinations
If the data has no outliers RMSE (Root Mean Square Error) is a good metric to use. %MAPE (Mean Absolute Percentage Error) provides a more inutitive understanding as it is expressed in percentage. We do not use %MAPE if the series is intermittent to avoid division by zero.
#define function for calculating RMSE and %MAPE
def accuracy(y1,y2):
accuracy_df=pd.DataFrame()
rms_error = np.round(rmse(y1, y2),1)
map_error = np.round(np.mean(np.abs((np.array(y1) - np.array(y2)) / np.array(y1))) * 100,1)
accuracy_df=accuracy_df.append({"RMSE":rms_error, "%MAPE": map_error}, ignore_index=True)
return accuracy_df
# Calculating the rolling mean and standard deviation for a window of 12 observations
rolmean=df_train.rolling(window=12).mean()
rolstd=df_train.rolling(window=12).std()
#Visualizing the rolling mean and standard deviation
#plot actual value and plot rolling/mean value
plt.figure(figsize=(16,8))
actual = plt.plot(df_train, color='blue', label='Actual Series')
rollingmean = plt.plot(rolmean, color='red', label='Rolling Mean')
rollingstd = plt.plot(rolstd, color='green', label='Rolling Std. Dev.')
plt.title('Rolling Mean & Standard Deviation of the Series - Training Data')
plt.legend()
plt.show()
Observations of training dataset
We can also use the Augmented Dickey Fuller (ADF) test to test if the series is stationary or not. The hypotheses for ADF test are defined as:
#Define a function to use adfuller test
def adfuller(df_train):
#Importing adfuller using statsmodels
from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
adftest = adfuller(df_train['Value'])
adfoutput = pd.Series(adftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Observations'])
for key,value in adftest[4].items():
adfoutput['Critical Value (%s)'%key] = value
print(adfoutput)
adfuller(df_train)
Observations
We can use some of the following methods to make a non-stationary series to stationary:
# Visualize the rolling mean and standard deviation after using log transformation
plt.figure(figsize=(16,8))
df_log = np.log(df_train)
MAvg = df_log.rolling(window=12).mean()
MStd = df_log.rolling(window=12).std()
plt.plot(df_log)
plt.plot(MAvg, color='r', label = 'Moving Average')
plt.plot(MStd, color='g', label = 'Standard Deviation')
plt.legend()
plt.show()
#check adfuller test
adfuller(df_log)
Observations:
The log transformation made the mean more steady, and the standard deviation is also more steady, however both are still changing and moving upwards so we have not achieved full stationarity. The adfuller test also shows a p-value of 0.896 which is still not showing stationarity.
Let's use the differencing method over the series to remove temporal dependence (trend) and again check the stationarity of the series.
# Visualize the rolling mean and standard deviation after using log transform and differencing
plt.figure(figsize=(16,8))
df_log_diff = df_log - MAvg
MAvg_diff = df_log_diff.rolling(window=12).mean() #calculate the mean
MStd_diff = df_log_diff.rolling(window=12).std() #calculate the standard deviation
plt.plot(df_log_diff) #plot the dataframe with differencing
plt.plot(MAvg_diff, color='r', label = 'Moving Average') #plot the moving average of the dataframe with differencing
plt.plot(MStd_diff, color='g', label = 'Standard Deviation') #plot the standard deviation of the dataframe with differencing
plt.legend()
plt.show()
#Dropping the null values that we get after applying diffrencing method
df_log_diff = df_log_diff.dropna()
#check adfuller test
adfuller(df_log_diff)
Observations
Let's shift the series by order 1 (or by 1 month) & apply differencing (using lagged series) and then check the rolling mean and standard deviation.
plt.figure(figsize=(16,8))
df_shift = df_log - df_log.shift(periods = 1)
MAvg_shift = df_shift.rolling(window=12).mean() #calculate the mean
MStd_shift = df_shift.rolling(window=12).std() #calculate the standard deviation
plt.plot(df_shift, color='c') #plot the dataframe with lag
plt.plot(MAvg_shift, color='red', label = 'Moving Average') #plot the moving average of the dataframe with lag
plt.plot(MStd_shift, color='green', label = 'Standard Deviation') #plot the standard deviation of the dataframe with lag
plt.legend()
plt.show()
#Dropping the null values that we get after applying diffrencing method
df_shift = df_shift.dropna()
#check adfuller test
adfuller(df_shift)
Observations
By shifting the series by order 1 (1 month), and applying differencing (using lagged series) we were able to refine the stationarity and the mean and standard deviation are now significantly smoother. The p-value also continues to confirm that we have stationarity and the p-value remains improved at 0.000959.
#Importing the seasonal_decompose to decompose the time series
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(gas) #Use the actual series to decompose
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
f, axs = plt.subplots(2,2,figsize=(15,15))
plt.subplot(411)
plt.plot(gas, label='Actual')
plt.legend()
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend()
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend()
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend()
plt.tight_layout()
Observations:
#Importing acf and pacf functions
from statsmodels.tsa.stattools import acf, pacf
#define function
def pacf_acf_plot(df_shift):
#Using 20 lags in the series
lag_acf = acf(df_shift, nlags=14,fft=False)
lag_pacf = pacf(df_shift, nlags=14, method='ols')
plt.figure(figsize=(16,8))
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='red')
plt.axhline(y=-1.96/np.sqrt(len(df_shift)),linestyle='--',color='red')
plt.axhline(y=1.96/np.sqrt(len(df_shift)),linestyle='--',color='red')
plt.title('Autocorrelation Function (acf)')
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='red')
plt.axhline(y=-1.96/np.sqrt(len(df_shift)),linestyle='--',color='red')
plt.axhline(y=1.96/np.sqrt(len(df_shift)),linestyle='--',color='red')
plt.title('Partial Autocorrelation Function (pacf)')
plt.show()
pacf_acf_plot(df_shift)
We will try another way to plot the acf and pacf to confirm the values. Although it appears to enter the confidence bands between 1 and 2.
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
def acf_pacf(df_shift):
plt.figure(figsize = (16,8))
plot_acf(df_shift)
plt.show()
plot_pacf(df_shift)
plt.show();
#call function
acf_pacf(df_shift)
Observations:
We can see that in both plots, the blue line is entering into the confidence interval (dotted red) at 1, and the most prominent marker in the blue plots is at 1, therefore we will be using the values of p, q values as 1, however it is very close to 2 so 2 might work as well.
#Importing AutoReg function to apply AR model
from statsmodels.tsa.ar_model import AutoReg
#Comparing the actual & predicted series by AR model and calculating the Residual Sum of Squares
df_shift.index.freq='MS'
plt.figure(figsize=(16,8))
model_AR = AutoReg(df_shift, lags=2) #Using number of lags as 2
results_AR = model_AR.fit()
plt.plot(df_shift) #Visualizing the actual series used for modelling
predict = results_AR.predict(start=0,end=len(df_shift)-1)
predict = predict.fillna(0) #Converting NaN values to 0
plt.plot(predict, color='red')
plt.title('AR Model - RSS: %.4f'% sum((predict-df_shift['Value'])**2)) #calculate the residual sum of squares
plt.show()
#print results of model
print(results_AR.summary())
Observations:
We will be using ARMA model with p=0 so that it will work as MA Model.
#Importing ARMA
from statsmodels.tsa.arima_model import ARMA
#Comparing the actual & predicted series by MA model and calculating the Residual Sum of Squares
plt.figure(figsize=(16,8))
df_shift.index.freq='MS'
model_MA = ARMA(df_shift, order=(0,2)) #Using p=0 and q=2
results_MA = model_MA.fit()
plt.plot(df_shift) #Visualzing the actual series used for modelling
plt.plot(results_MA.fittedvalues, color='red')
plt.title('MA Model - RSS: %.4f'% sum((results_MA.fittedvalues-df_shift['Value'])**2)) #calculate the residual sum of squares
plt.show()
#print results of model
print(results_MA.summary())
Observations:
We will using p=2 and q=2 as inferred from acf and pacf plots.
from statsmodels.tsa.arima_model import ARMA
#Comparing the actual & predicted series by AR model and calculating the Residual Sum of Squares
plt.figure(figsize=(16,8))
df_shift.index.freq='MS'
model = ARMA(df_shift, order=(2,2)) #Using p=2, q=2
results = model.fit()
plt.plot(df_shift) #Visualizing the actual series used for modelling
plt.plot(results.fittedvalues, color='red')
plt.title('ARMA Model - RSS: %.4f'% sum((results.fittedvalues-df_shift['Value'])**2)) #calculate the residual sum of squares
plt.show()
#print results of model
print(results.summary())
#Printing the fitted values
predictions=pd.Series(results.fittedvalues)
predictions
Observations:
Since we now have the fitted values by ARMA model, we will use the inverse transformation to get the original values.
#First step - doing cumulative sum
predictions_cumsum = predictions.cumsum() #Use the predicted values series
predictions_cumsum
#Second step - Adding the first value of the log series to the cumulative sum values
predictions_log = pd.Series(df_log['Value'].iloc[0], index=df_log.index)
predictions_log = predictions_log.add(predictions_cumsum, fill_value=0) #Use the series with cumulative sum
predictions_log
#Third step - applying exponential transformation
predictions_ARMA = np.exp(predictions_log) #Use the series with log values
predictions_ARMA
#Plotting the original vs predicted series
plt.figure(figsize=(16,8))
plt.plot(df_train, color = 'c', label = 'Original Series')
plt.plot(predictions_ARMA, color = 'r', label = 'Predicted Series') #Use the series with predicted values
plt.title('Actual vs Predicted')
plt.xlabel("Years")
plt.ylabel("CO2 Emissions in MMT")
plt.legend()
plt.show()
Observations:
In order to forecast the values for next 24 months, we need to follow these steps:
#Forecasting the values for next 12 months
forecasted_ARMA = results.forecast(steps=13) # here steps represent the number of months
forecasted_ARMA[0]
# Creating a list containing all the forecasted value
list1 = forecasted_ARMA[0].tolist()
series1 = pd.Series(list1)
series1
cumsum1 = series1.cumsum()
cumsum1
#Making a new dataframe to get the additional dates from July 2016 to July 2017
index = pd.date_range('2016-07-01','2017-08-01' , freq='1M')
df1 = pd.DataFrame()
df1['cumsum1'] = cumsum1
df1.index = index
df1
df_log.tail()
#Adding the last term of the train log series in the cumulative values
df1['forecasted'] = df1['cumsum1'] + float(df_log.loc['2015-06-01'])
df1
#Applying exponential transformation to the forecasted log values
forecasted_ARMA = np.exp(df1['forecasted'])
forecasted_ARMA
#plot forecast for 2016-2017
forecasted_ARMA.plot(figsize=(16,8),color="green")
plt.title("ARMA Model CO2 Emissions Forecast 2016-2017 for Natural Gas")
plt.xlabel("Year")
plt.ylabel("CO2 emissions in MMT")
plt.show()
#total CO2 emissions for next year
forecasted_ARMA.sum()
plt.figure(figsize = (16,8))
plt.plot(df_train)
plt.plot(forecasted_ARMA, label = 'predicted')
plt.title("Actual vs. Predicted Forecast (ARMA Model) 2016-2017")
plt.xlabel("Year")
plt.ylabel("CO2 Emissions in MMT")
plt.show()
#calculate mean squared error first
from sklearn.metrics import mean_squared_error
error = mean_squared_error(forecasted_ARMA, df_test, squared = False)
error
#now calculate RMSE
from math import sqrt
rmse = sqrt(error)
print('RMSE: %f' % rmse)
#define function for MAPE calculation
import numpy as np
def mape(actual, pred):
actual, pred = np.array(actual), np.array(pred)
return np.mean(np.abs((actual - pred) / actual)) * 100
mape(df_test,forecasted_ARMA)
#create accuracy df
d={'ARMA RMSE':[2.565], 'ARMA MAPE':[14.676]}
accuracy_df=pd.DataFrame(data=d,index=['Values'])
accuracy_df
Observations
First Run auto arima to find best values of p,q,d,P,Q,D
#Use auto arima to find best values for sarima model
#Finding the best values for p,q,d,P,Q,D
auto_arima(df_train['Value'], start_p = 1, start_q = 1,
max_p = 3, max_q = 3, m = 12,
start_P = 0, seasonal = True,
d = None, D = 1, trace = True,
error_action ='ignore', # we don't want to know if an order does not work
suppress_warnings = True, # we don't want convergence warnings
stepwise = True).summary() # set to stepwise
Observations: Seen above in the auto arima result is the best AIC value obtained was 793.532 with an order of (1,0,0) and seasonal order of (2,1,0,12). We will use this for the SARIMA model building.
# Train and fit the SARIMA model
model1 = model1 = SARIMAX(df_train['Value'],
order = (1,0,0),
seasonal_order =(2,1,0, 12)) #seasonal order based on auto arima model
sarima_fit = model1.fit()
# Forecast for the next year + 2 for comparison
forecast1 = sarima_fit.predict(start = len(df_train),
end = (len(df_train)-1) + 3 * 12,
type = 'levels').rename('Forecast')
# Plot the forecast values
df_train['Value'].plot(figsize = (12, 5), legend = True)
df_train.index.freq="MS"
forecast1.plot(legend = True)
#title and labels of plot
plt.title('Actual vs. Predicted - SARIMA Forecast for CO2 Emissions')
plt.ylabel('CO2 Emissions in MMT')
plt.xlabel("Years")
plt.show();
#model results
sarima_fit.summary()
Observations
#plot diagnostics of SARIMA
sarima_fit.plot_diagnostics(figsize=(16,8));
plt.show()
Observations
#sarima model forecast
print(f'SARIMA Model Forecast')
forecast_values1 = sarima_fit.get_forecast(steps=25)
fv_df1 = forecast_values1.summary_frame()
fv_df1
#isolate 2016-2017 sarima forecast
sarima_fcast = fv_df1.tail(13)
sarima_fcast = sarima_fcast['mean']
sarima_fcast
#plot 2016-2017 forecast for sarima model
sarima_fcast.plot(figsize=(16,8),color="green")
plt.title("SARIMA Model CO2 Emissions Forecast for 2016-2017 for Natural Gas")
plt.xlabel("Year")
plt.ylabel("CO2 emissions in MMT")
plt.show();
#total CO2 emissions for next year
sarima_fcast.sum()
print('SARIMA Model Forecast: Predicted Mean')
pred_mean1 = pd.DataFrame(forecast_values1.predicted_mean)
pred_mean1
print(f'SARIMA Model Predicted Mean Sum by Year')
pred_mean1.resample('Y').sum()
#get confidence interval
forecast_ci1 = forecast_values1.conf_int(alpha=.05)
forecast_ci1.head()
print(f'SARIMA Model Forecast Upper and Lower CO2 Emissions Confidence Interval')
conf_int1 = forecast_values1.conf_int(alpha=.05)
conf_int1
print(f'Yearly Totals for Lower and Upper CO2 Emissions Forecast for SARIMA')
conf_int1.resample('Y').sum()
#plot confidence intervals
ax = df_train.plot(figsize=(12,8))
#Plot the Forecasted Values
forecast_values1.predicted_mean.plot(ax=ax,label='Forecasts')
#Plot the confidence Intervals
ax.fill_between(forecast_ci1.index,
forecast_ci1.iloc[:,0],
forecast_ci1.iloc[:,1],color='g',alpha=.5)
#axes labels and titles
ax.set_xlabel('Time')
ax.set_ylabel('CO2 Emissions in MMT')
ax.set_title('SARIMA Forecast with Confidence Intervals for July 2016-July 2017 for Natural Gas CO2 Emissions')
plt.legend()
plt.show()
#calculate mean squared error first
from sklearn.metrics import mean_squared_error
error2 = mean_squared_error(sarima_fcast, df_test, squared = False)
error2
#now calculate RMSE
from math import sqrt
rmse = sqrt(error2)
print('RMSE: %f' % rmse)
#function for mape
import numpy as np
def mape(actual, pred):
actual, pred = np.array(actual), np.array(pred)
return np.mean(np.abs((actual - pred) / actual)) * 100
#mape result
mape(df_test,sarima_fcast)
#add to accuracy_df
accuracy_df.insert(2,'SARIMA RMSE',[2.641])
accuracy_df.insert(3,'SARIMA MAPE',[20.176])
accuracy_df
#early view of first 2 models
accuracy_df.T
Observations so far:
We will use the Holt Winters Model as it is excellent at handling seasonality and trend which this data has significant trend and seasonality.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
triple_model = ExponentialSmoothing(df_train['Value'],trend='add',seasonal='add',seasonal_periods=12,freq="MS").fit()
test_predictions = triple_model.forecast(25).rename('TES Forecast')
#print predictions
print(test_predictions)
#get forecast for 2016-2017 CO2 emissions
exp_smooth = test_predictions.tail(13)
exp_smooth
#plot exponential smoothing forecast 2016-2017
exp_smooth.plot(figsize=(16,8),color="green")
plt.title("Exponential Smoothing Model CO2 Emissions Forecast 2016-2017 for Natural Gas")
plt.xlabel("Year")
plt.ylabel("CO2 emissions in MMT")
plt.show();
#total CO2 emissions for next year
exp_smooth.sum()
#plot predictions
df_train['Value'].plot(legend=True,label='TRAIN')
df_test['Value'].plot(legend=True,label='TEST',figsize=(12,8))
test_predictions.plot(legend=True,label='PREDICTION')
plt.title('Exponential Smoothing Model: Actual vs. Predicted for 2016-2017 Forecast')
plt.xlabel('Years')
plt.ylabel("CO2 Emissions in MMT")
plt.show();
#calculate mean squared error first
from sklearn.metrics import mean_squared_error
error3 = mean_squared_error(exp_smooth, df_test, squared = False)
error3
#now calculate RMSE
from math import sqrt
rmse = sqrt(error3)
print('RMSE: %f' % rmse)
#mape test
mape(df_test,exp_smooth)
#add to accuracy_df
accuracy_df.insert(2,'Exp Smooth RMSE',[1.296])
accuracy_df.insert(3,'Exp Smooth MAPE',[17.511])
accuracy_df
Observations so far:
#install pystan and fbprophet
!pip install pystan
!pip install fbprophet
df_pr = df_train.copy()
df_pr = df_train.reset_index()
df_pr.columns = ['ds','y'] # To use prophet column names should be like that
from fbprophet import Prophet
m = Prophet(interval_width=0.95,yearly_seasonality=True, changepoint_range=0.95, changepoint_prior_scale= 0.080 )
m.fit(df_pr)
future = m.make_future_dataframe(periods=26,freq='M')
prophet_pred = m.predict(future)
prophet_pred.tail()
#get forecast
forecastdf = prophet_pred[['ds','trend','yhat','yhat_lower','yhat_upper']]
forecastdf.tail(24)
forecastdf = pd.DataFrame(forecastdf)
forecastdf
#change names and set index
forecastdf = forecastdf.rename(columns={'ds':'Month','trend':'Trend','yhat':'Predicted Value','yhat_lower':'Lower_Prediction','yhat_upper':'Upper_Prediction'})
#set index
forecastdf.set_index('Month',inplace=True)
#forecastdf
forecastdf
#monthly co2 emissions predictions for prophet
resamp_month_prophet = forecastdf.resample('M').sum()
resamp_month_prophet.tail(13)
#get forecast for 2016-2017
prophet_co2 = resamp_month_prophet['Predicted Value'].tail(13)
prophet_co2
#plot 2016-2017 CO2 emissions forecast
prophet_co2.plot(figsize=(10,5),color="green")
plt.title("Prophet Model CO2 Emissions Forecast for Natural Gas: 2016-2017")
plt.xlabel("Year")
plt.ylabel("CO2 Emissions in MMT")
plt.show();
#CO2 total forecast for next year
prophet_co2.sum()
#plot predictions with changepoints
from fbprophet.plot import add_changepoints_to_plot
fig1 = m.plot(prophet_pred)
a = add_changepoints_to_plot(fig1.gca(), m, prophet_pred)
plt.title("Prophet Forecast for July 2016-July 2017 with Changepoints")
plt.xlabel("Date")
plt.ylabel("CO2 Emissions in MMT")
plt.show();
Observations
#plot trend
fig2 = m.plot_components(prophet_pred)
Observations -The Trend in the Prophet Model shows the exponential growth that has continued in CO2 emissions.
#calculate mean squared error first
from sklearn.metrics import mean_squared_error
error4 = mean_squared_error(prophet_co2, df_test, squared = False)
error4
#now calculate RMSE
from math import sqrt
rmse = sqrt(error4)
print('RMSE: %f' % rmse)
#mape test
mape(df_test,prophet_co2)
#add to accuracy_df
accuracy_df.insert(2,'Prophet RMSE',[2.328])
accuracy_df.insert(3,'Prophet MAPE',[19.762])
accuracy_df
Observations
#examine natural gas totals for last 5 years
gas.Value.resample('Y').sum().tail()
#plot last year of training data
gas.tail(12).plot(figsize=(16,8))
plt.title("Natural Gas CO2 Emissions: last year of training data")
plt.xlabel("Year")
plt.ylabel("CO2 Emissions in MMT")
plt.show();
#final accuracy dataframe
accuracy_df.T
#Sum of natural gas CO2 emissions per year in data set
gas['Value'].tail(31).resample('Y').sum()
The raw historical data for Natural Gas Emissions trend for CO2 emissions for the electricity sector for July to July is as follows (courtesy of EIA.gov):
2014-2015: 538.3
2017-2018: 609.6
This is obviously different than the yearly totals. But for our purposes we will compare our predictions to that of the raw historical data which says that July 2016-2017 had a total of 571.5 MMT in CO2 emissions. Therefore our closest prediction to this is the Prophet Model at 571.440 MMT.
#import sklearn timeseries split
from sklearn.model_selection import TimeSeriesSplit
#Specify fold and perform splitting
tscv = TimeSeriesSplit(n_splits=3)
tscv.split(gas)
#Find out no of observations in train and test sets
i=0
for train, test in tscv.split(gas):
i=i+1
print ("No of observations under train%s=%s" % (i, len(train)))
print ("No of observations under test%s=%s" % (i, len(test)))
#total sum of dataset
393+130
Now we will split the data into three training (train1, train2 and train3) and test sets (test1, test2 and test3) each using iloc function.
#Splitting according to the above description
train1, test1 = gas.iloc[:133, 0], gas.iloc[130:263, 0]
train2, test2 = gas.iloc[:263, 0], gas.iloc[263:393, 0]
train3, test3 = gas.iloc[:393, 0], gas.iloc[393:523, 0]
So we now have different windows of the dataset in time series succession.
#Fit Triple Exp Smoothing Model
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_squared_error
from math import sqrt
#First fold RMSE
model1_es = ExponentialSmoothing(train1, trend='add',seasonal='add', seasonal_periods=12,freq='MS').fit()
pred1_es = model1_es.predict(start=test1.index[0], end=test1.index[-1])
RMSE1_es=round(sqrt(mean_squared_error(test1, pred1_es)),2)
MAPE1_es = round(mape(test1,pred1_es),2)
#Second fold RMSE
model2_es = ExponentialSmoothing(train2, trend='add', seasonal='add', seasonal_periods=12,freq='MS').fit()
pred2_es = model2_es.predict(start=test2.index[0], end=test2.index[-1])
RMSE2_es=round(sqrt(mean_squared_error(test2, pred2_es)),2)
MAPE2_es = round(mape(test2, pred2_es),2)
#Third fold RMSE
model3_es = ExponentialSmoothing(train3, trend='add', seasonal='add', seasonal_periods=12,freq='MS').fit()
pred3_es = model3_es.predict(start=test3.index[0], end=test3.index[-1])
RMSE3_es=round(sqrt(mean_squared_error(test3, pred3_es)),2)
MAPE3_es= round(mape(test3, pred3_es),2)
print ("RMSE1 for Triple Exp Smoothing:", RMSE1_es)
print ("RMSE2 for Triple Exp Smoothing:", RMSE2_es)
print ("RMSE3 for Triple Exp Smoothing:", RMSE3_es)
Overall_RMSE_es=round((RMSE1_es+RMSE2_es+RMSE3_es)/3,2)
print ("Overall RMSE for Triple Exp Smoothing:", Overall_RMSE_es)
print(" ")
print(" ")
print ("MAPE1 for Triple Exp Smoothing:", MAPE1_es)
print("MAPE2 for Triple Exp Smoothing:", MAPE2_es)
print("MAPE3 for Triple Exp Smoothing:", MAPE3_es)
Overall_MAPE_es=round((MAPE1_es+MAPE2_es+MAPE3_es)/3,2)
print ("Overall MAPE for Triple Exp Smoothing:", Overall_MAPE_es)
#plot
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
%matplotlib inline
plt.style.use('fivethirtyeight')
from pylab import rcParams
rcParams['figure.figsize'] = 14,8
#Labels and titles
plt.xlabel("Time")
plt.ylabel("CO2 Emissions in MMT")
plt.title("3 Fold Sliding Window Cross-Validation for Triple Exponential Smoothing")
#First fold- CV
plt.plot(train1.index, train1, label='Train1')
plt.plot(test1.index, test1, label='Test1')
plt.plot(pred1_es.index, pred1_es, label='Triple Exp Smoothing prediction1')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test1.index[0], test1.index[-1], facecolor='g', alpha=0.1)
#Second fold
plt.plot(train2.index, train2, label='Train2')
plt.plot(test2.index, test2, label='Test2')
plt.plot(pred2_es.index, pred2_es, label='Triple Exp Smoothing prediction2')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test2.index[0], test2.index[-1], facecolor='r', alpha=0.2)
#Third fold
plt.plot(test3.index, test3, label='Test3')
plt.plot(pred3_es.index, pred3_es, label='Triple Exp Smoothing prediction3')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test3.index[0], test3.index[-1], facecolor='b', alpha=0.3);
Observations
Before we perform crosss validation we will run the auto arima function to find the best parameters for the models to cross validate the data on.
#define function for auto arima hyperparameter search
#Use auto arima to find best values for sarima model
#Finding the best values for p,q,d,P,Q,D
def auto(x):
auto_arima(x, start_p = 1, start_q = 1,
max_p = 3, max_q = 3, m = 12,
start_P = 0, seasonal = True,
d = None, D = 1, trace = True,
error_action ='ignore', # we don't want to know if an order does not work
suppress_warnings = True, # we don't want convergence warnings
stepwise = True).summary() # set to stepwise
#parameter search train1
auto(train1)
For train1 the best parameters are: order=(1,0,0) and seasonal_order=(0,1,1,12)
#parameter search train 2
auto(train2)
The best parameters for train2 are: order=(2,1,2), seasonal_order=(0,1,1,12)
#parameter search for train3
auto(train3)
The best fit for train3 is: order=(3,1,3), seasonal_order=(0,1,1,12)
#Fit SARIMA model
#First fold RMSE
model1_sarima = SARIMAX(train1, order = (1,0,0),
seasonal_order =(0,1,1, 12),freq='MS').fit()
pred1_sarima = model1_sarima.predict(start=test1.index[0], end=test1.index[-1])
RMSE1_sarima=round(sqrt(mean_squared_error(test1, pred1_sarima)),2)
MAPE1_sarima = round(mape(test1,pred1_sarima),2)
#Second fold RMSE
model2_sarima = SARIMAX(train2, order=(2,1,2), seasonal_order=(0,1,1,12),freq='MS').fit()
pred2_sarima = model2_sarima.predict(start=test2.index[0], end=test2.index[-1])
RMSE2_sarima=round(sqrt(mean_squared_error(test2, pred2_sarima)),2)
MAPE2_sarima = round(mape(test2, pred2_sarima),2)
#Third fold RMSE
model3_sarima = SARIMAX(train3, order=(3,1,3), seasonal_order=(0,1,1,12),freq='MS').fit()
pred3_sarima = model3_sarima.predict(start=test3.index[0], end=test3.index[-1])
RMSE3_sarima=round(sqrt(mean_squared_error(test3, pred3_sarima)),2)
MAPE3_sarima= round(mape(test3, pred3_sarima),2)
print ("RMSE1 for SARIMA:", RMSE1_sarima)
print ("RMSE2 for SARIMA:", RMSE2_sarima)
print ("RMSE3 for SARIMA:", RMSE3_sarima)
Overall_RMSE_sarima=round((RMSE1_sarima+RMSE2_sarima+RMSE3_sarima)/3,2)
print ("Overall RMSE for SARIMA:", Overall_RMSE_sarima)
print(" ")
print(" ")
print ("MAPE1 for SARIMA:", MAPE1_sarima)
print("MAPE2 for SARIMA:", MAPE2_sarima)
print("MAPE3 for SARIMA:", MAPE3_sarima)
Overall_MAPE_sarima=round((MAPE1_sarima+MAPE2_sarima+MAPE3_sarima)/3,2)
print ("Overall MAPE for SARIMA:", Overall_MAPE_sarima)
#plot CV for sarima
#Labels and titles
plt.xlabel("Time")
plt.ylabel("CO2 Emissions in MMT")
plt.title("3 Fold Sliding Window Cross-Validation for SARIMA Model")
#First fold- CV
plt.plot(train1.index, train1, label='Train1')
plt.plot(test1.index, test1, label='Test1')
plt.plot(pred1_sarima.index, pred1_sarima, label='SARIMA prediction1')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test1.index[0], test1.index[-1], facecolor='g', alpha=0.1)
#Second fold
plt.plot(train2.index, train2, label='Train2')
plt.plot(test2.index, test2, label='Test2')
plt.plot(pred2_sarima.index, pred2_sarima, label='SARIMA prediction2')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test2.index[0], test2.index[-1], facecolor='r', alpha=0.2)
#Third fold
plt.plot(test3.index, test3, label='Test3')
plt.plot(pred3_sarima.index, pred3_sarima, label='SARIMA prediction3')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test3.index[0], test3.index[-1], facecolor='b', alpha=0.3);
Observations
First we will figure out the p,q values using the pacf_acf_plot function we wrote earlier.
#train1
pacf_acf_plot(train1)
It appears the lowest values are 2 and 2 where it enters the confidence bands for train 1.
#train 2
pacf_acf_plot(train2)
For train 2 it appears it is 3, 2 where it enters the confidence bands.
#train 3 pacf acf plot
pacf_acf_plot(train3)
For train3 it appears it is 4,2 where it enters the confidence bands.
#Fit ARMA model
#First fold RMSE
model1_arma = ARMA(train1, order=(2,2),freq='MS').fit()
pred1_arma = model1_arma.predict(start=test1.index[0], end=test1.index[-1])
RMSE1_arma=round(sqrt(mean_squared_error(test1, pred1_arma)),2)
MAPE1_arma = round(mape(test1,pred1_arma),2)
#Second fold RMSE
model2_arma = ARMA(train2, order=(3,2),freq='MS').fit()
pred2_arma = model2_arma.predict(start=test2.index[0], end=test2.index[-1])
RMSE2_arma=round(sqrt(mean_squared_error(test2, pred2_arma)),2)
MAPE2_arma = round(mape(test2, pred2_arma),2)
#Third fold RMSE
model3_arma = ARMA(train3, order=(5,2),freq='MS').fit()
pred3_arma = model3_arma.predict(start=test3.index[0], end=test3.index[-1])
RMSE3_arma=round(sqrt(mean_squared_error(test3, pred3_arma)),2)
MAPE3_arma= round(mape(test3, pred3_arma),2)
print ("RMSE1 for ARMA:", RMSE1_arma)
print ("RMSE2 for ARMA:", RMSE2_arma)
print ("RMSE3 for ARMA:", RMSE3_arma)
Overall_RMSE_arma=round((RMSE1_arma+RMSE2_arma+RMSE3_arma)/3,2)
print ("Overall RMSE for ARMA:", Overall_RMSE_arma)
print(" ")
print(" ")
print ("MAPE1 for ARMA:", MAPE1_arma)
print("MAPE2 for ARMA:", MAPE2_arma)
print("MAPE3 for ARMA:", MAPE3_arma)
Overall_MAPE_arma=round((MAPE1_arma+MAPE2_arma+MAPE3_arma)/3,2)
print ("Overall MAPE for ARMA:", Overall_MAPE_arma)
#plot CV for ARMA
#Labels and titles
plt.xlabel("Time")
plt.ylabel("CO2 Emissions in MMT")
plt.title("3 Fold Sliding Window Cross-Validation for ARMA Model")
#First fold- CV
plt.plot(train1.index, train1, label='Train1')
plt.plot(test1.index, test1, label='Test1')
plt.plot(pred1_arma.index, pred1_arma, label='ARMA prediction1')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test1.index[0], test1.index[-1], facecolor='g', alpha=0.1)
#Second fold
plt.plot(train2.index, train2, label='Train2')
plt.plot(test2.index, test2, label='Test2')
plt.plot(pred2_arma.index, pred2_arma, label='ARMA prediction2')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test2.index[0], test2.index[-1], facecolor='r', alpha=0.2)
#Third fold
plt.plot(test3.index, test3, label='Test3')
plt.plot(pred3_arma.index, pred3_arma, label='ARMA prediction3')
plt.legend(loc='best')
#Highlighting the region
plt.axvspan(test3.index[0], test3.index[-1], facecolor='b', alpha=0.3);
Observations
#make 3 dataframes with the same train folds as above
df_train1 = train1.copy()
df_train1 = train1.reset_index()
df_train1.columns = ['ds','y'] # To use prophet column names should be like that
df_train2 = train2.copy()
df_train2 = train2.reset_index()
df_train2.columns = ['ds','y'] # To use prophet column names should be like that
df_train3 = train3.copy()
df_train3 = train3.reset_index()
df_train3.columns = ['ds','y'] # To use prophet column names should be like that
Fit First fold on cross validation train1
#fit prophet model on train1
m_cv1 = Prophet(interval_width=0.95,yearly_seasonality=True, changepoint_range=0.95)
m_cv1.fit(df_train1)
future_cv1 = m_cv1.make_future_dataframe(periods=26,freq='M')
prophet_pred_cv1 = m_cv1.predict(future_cv1)
# Run cross validation
from fbprophet.diagnostics import cross_validation
df_cv = cross_validation(m_cv1, initial='730 days', period='180 days', horizon = '365 days')
df_cv.head()
df_cv.tail()
# Performance metrics
from fbprophet.diagnostics import performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
#calculate RMSE
RMSE_pr1 = round(df_p['rmse'].mean(),2)
RMSE_pr1
#calculate MAPE
MAPE_pr1 = round(df_p['mape'].mean()*100,2)
MAPE_pr1
# Plot rmse
from fbprophet.plot import plot_cross_validation_metric
fig_rmse1 = plot_cross_validation_metric(df_cv, metric='rmse')
# Plot mape
from fbprophet.plot import plot_cross_validation_metric
fig_mape1 = plot_cross_validation_metric(df_cv, metric='mape')
Fit 2nd train fold on Prophet
#fit 2nd prophet model on train2
m_cv2 = Prophet(interval_width=0.95,yearly_seasonality=True, changepoint_range=0.95)
m_cv2.fit(df_train2)
future_cv2 = m_cv2.make_future_dataframe(periods=26,freq='M')
prophet_pred_cv2 = m_cv2.predict(future_cv2)
# Run cross validation
df_cv2 = cross_validation(m_cv2, initial='730 days', period='180 days', horizon = '365 days')
df_cv2.head()
df_cv2.tail()
# Performance metrics
df_p2 = performance_metrics(df_cv2)
df_p2.head()
#calculate RMSE
RMSE_pr2 = round(df_p2['rmse'].mean(),2)
RMSE_pr2
#calculate MAPE
MAPE_pr2 = round(df_p2['mape'].mean()*100,2)
MAPE_pr2
# Plot rmse
fig_rmse2 = plot_cross_validation_metric(df_cv2, metric='rmse')
# Plot mape
fig_mape2 = plot_cross_validation_metric(df_cv2, metric='mape')
Fit 3rd Train fold on Prophet
#fit prophet model on train1
m_cv3 = Prophet(interval_width=0.95,yearly_seasonality=True, changepoint_range=0.95)
m_cv3.fit(df_train3)
future_cv3 = m_cv3.make_future_dataframe(periods=26,freq='M')
prophet_pred_cv3 = m_cv3.predict(future_cv3)
# Run cross validation
df_cv3 = cross_validation(m_cv3, initial='730 days', period='180 days', horizon = '365 days')
df_cv3.head()
df_cv3.tail()
# Performance metrics
df_p3 = performance_metrics(df_cv3)
df_p3.head()
#Calculate RMSE
RMSE_pr3 = round(df_p3['rmse'].mean(),2)
RMSE_pr3
#Calculate MAPE
MAPE_pr3 = round(df_p3['mape'].mean()*100,2)
MAPE_pr3
# Plot rmse
fig_rmse3 = plot_cross_validation_metric(df_cv3, metric='rmse')
# Plot mape
fig_mape3 = plot_cross_validation_metric(df_cv3, metric='mape')
#overall outcome for all 3 folds cv for Prophet
print ("RMSE1 for Prophet:", RMSE_pr1)
print ("RMSE2 for Prophet:", RMSE_pr2)
print ("RMSE3 for Prophet:", RMSE_pr3)
Overall_rmse_pr=round((RMSE_pr1+RMSE_pr2+RMSE_pr3)/3,2)
print ("Overall RMSE for Prophet:", Overall_rmse_pr)
print(" ")
print(" ")
print ("MAPE1 for Prophet:", MAPE_pr1)
print("MAPE2 for Prophet:", MAPE_pr2)
print("MAPE3 for Prophet:", MAPE_pr3)
Overall_MAPE_pr=round((MAPE_pr1+MAPE_pr2+MAPE_pr3)/3,2)
print ("Overall MAPE for Prophet:", Overall_MAPE_pr)
Observations
resamp_month_prophet['Predicted Value'].iloc[181:194].sum()
Final Cross Validation Results
#build dataframe
index1 = ['Triple Exp Smoothing','SARIMA','ARMA','Prophet']
data1 = {'RMSE':[4.87,3.69,9.56,2.16],'MAPE':[14.6, 12.24,29.88,11.37]}
cv_results = pd.DataFrame(data=data1,index=index1)
#print df
print("Cross Validation Results")
cv_results
#plot cross validation results
cv_results.plot(kind='bar',figsize=(16,8))
plt.title('Cross Validation Results')
plt.xlabel('Models Tested')
plt.ylabel('Values')
plt.show()
Observations -Even though MAPE is a percent and RMSE is not we can roughly see the comparisons for each model.
Accuracy Dataframe results
#Final accuracy results
index2 = ['Triple Exp Smoothing','SARIMA','ARMA','Prophet']
data2 = {'RMSE':[1.296,2.641,2.565,2.328],'MAPE':[17.511,20.176,14.676,19.762]}
accuracy = pd.DataFrame(data=data2,index=index2)
#print df
print("Accuracy Measures for all Forecast Models")
accuracy
Observations
Final Forecast for July 2016-July 2017 for all models
#Final accuracy results
index3 = ['Triple Exp Smoothing','Actual 2016-2017','Prophet','ARMA','SARIMA']
data3 = {'Total CO2 Emissions Forecast July 2016-July 2017':[613.229,571.5,571.440,562.68,524.42]}
Forecast_year = pd.DataFrame(data=data3,index=index3)
#print df
Forecast_year
Plot of all forecasts compared to train and test data
#Total forecast for July 2016-2017 comparison
df_test['Value'].plot(legend=True,label='TEST',figsize=(16,8))
exp_smooth.plot(legend=True,label='Exponential Smoothing Prediction')
forecasted_ARMA.plot(legend=True,label='ARMA Prediction')
sarima_fcast.plot(legend=True,label='SARIMA Prediction')
prophet_co2.plot(legend=True,label='Prophet Prediction')
plt.title('Comparison of All Forecast Models: Actual vs. Predicted for July 2016- July 2017 Forecast')
plt.xlabel('Years')
plt.ylabel("CO2 Emissions in MMT")
plt.show();
Observations
a) July 2014-2015: 538.298
b) July 2015-2016: 607.844
c) July 2016-2017: Our forecast is 571.440, actual (historical EIA.gov) is 571.537.
d) July 2017-2017: 609.577.